Add RL token throughput and packing metrics by tdene · Pull Request #3877 · NVIDIA/Megatron-LM

tdene · 2026-03-15T22:31:24Z

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

copy-pr-bot · 2026-03-15T22:31:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

megatron/rl/rl_utils.py

megatron/training/training.py

yobibyte · 2026-03-16T10:24:55Z

megatron/rl/sequence_packing_utils.py

+    Returns:
+        Total compute tokens (num_bins * bin_size) on this rank.
+    """
+    if packing_context is None or packing_context.packed_trajs is None:


Your typing says that PackingContext cannot be None

megatron/rl/sequence_packing_utils.py

tdene · 2026-03-16T13:41:13Z

/claude review

claude · 2026-03-16T13:42:25Z

megatron/training/training.py

+
+            # Add tokens/sec to log string
+            log_string += f' toks/s: {tokens_per_sec:.0f} |'
+            log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'


compute_tokens is assigned here but never used. Was this intended for something (e.g., a log line or the packing_efficiency calculation)? If not, it should be removed to avoid confusion.

Suggested change

log_string += f' toks/s/gpu: {tokens_per_sec_per_gpu:.0f} |'

actual_tokens = rl_utils.get_packing_actual_tokens(runtime_state.packing_context)

jaredcasper · 2026-03-19T21:01:06Z

megatron/training/training.py

+                    packing_efficiency = rl_utils.get_packing_efficiency(runtime_state.packing_context)
+
+            # Add tokens/sec to log string
+            log_string += f' toks/s: {tokens_per_sec:.0f} |'


Is this going to add this metric to the log for all training? I'm not sure we use this metric a lot in pretraining, so nervous it might just be adding noise to the log.

I've moved all the extra metrics in training.py into a single if-block guarded by args.perform_rl_step; does that look good?

yobibyte · 2026-03-17T13:05:20Z

megatron/rl/rl_utils.py

        self.sequences_this_iteration_on_rank = 0
        self.latest_batch_num_sequences = 0
+        # Derived throughput metrics (set by training_log, read by RLProfiler)
+        self.tokens_per_sec = None


Please, add the field descriptions here.

yobibyte · 2026-03-17T13:07:07Z

megatron/rl/rl_utils.py

+        self.tokens_per_sec = None
+        self.tokens_per_sec_per_gpu = None
+        self.actual_tokens_per_sec = None
+        self.actual_tokens_per_sec_per_gpu = None


Do we actually need _per_gpu variables here? How about we store tokens/actual_tokens and world_size and have a method that does the actual division?

yobibyte · 2026-03-20T10:25:10Z

megatron/training/training.py

        log_string += ' number of nan iterations: {:3d} |'.format(total_loss_dict[nan_iters_key])
+
+        # RL token throughput metrics.
+        if args.perform_rl_step:


Should we move this to a function in the RL folder? training.py becomes unreadable.

Add RL token throughput and packing metrics

fafaa0c

Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>

tdene marked this pull request as ready for review March 15, 2026 22:31

tdene requested a review from a team as a code owner March 15, 2026 22:31

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 15, 2026

svcnvidia-nemo-ci requested a review from a team March 15, 2026 22:31

copy-pr-bot bot temporarily deployed to test March 15, 2026 22:32 Inactive

svcnvidia-nemo-ci added the complexity: low label Mar 15, 2026